# 44 Least squares and linear modeling.

Date modified: Wed 2024-05-29 . 02:21 PM
Date created: Wed 2024-05-22 . 08:40 AM
# 44 Least squares and linear modeling. 🚧🚧Need to add figures🚧🚧 ## Least squares solutions. Recall our humble beginning of solving the linear equation $$ A x = b $$for unknown vector $x\in \mathbb R^{k}$, where $A$ some $n\times k$ matrix, and $b\in \mathbb R^{n}$ some vector. And recall we have the following outcomes: - $Ax = b$ is consistent, and - there is unique solution, or - there is infinitely many solutions. - $Ax=b$ is inconsistent, no solution. Before, the story ended there when $Ax = b$ is inconsistent, we could not find any $x$ that will make $Ax$ equal to $b$. However, now we like to ask, ok fine, if one cannot ever pick any $x$ such that $Ax$ equals to $b$, can one pick $\hat x$ such that $A\hat x$ is _as close to $b$ as possible_? Such $\hat x$ is called a **least squares solution.** How can we find these least squares solutions? When $Ax = b$ is inconsistent, this happens because $b$ is not in the columnspace of $A$, making $Ax$ never equaling to $b$. But by best approximation theorem, we do know that the vector in the columnspace of $A$ that is closest to $b$ is the orthogonal projection of $b$ onto $CS(A)$. So there is some $A\hat x$ such that $$ A\hat x = \text{proj}_{CS(A)}b $$ Let us think more about how to solve for this $\hat x$. By orthogonal decomposition, we know that $$ b-\text{proj}_{CS(A)}b \in CS(A)^{\perp} $$ So we have $$ b-A\hat x \in CS(A)^{\perp} $$But by fundamental theorem of linear algebra, we know $CS(A)^{\perp} = NS(A^{T})$. Hence $$ b-A\hat x \in NS(A^{T}), $$in other words, $$ A^{T}(b-A\hat x) = \vec 0 $$or $$ A^{T}A\hat x = A^{T}b. $$This is called the **normal equation** to the linear equation $Ax=b$. In summary, > **Least squares solution from the normal equation.** > Given linear equation $Ax = b$. Then the **least squares solutions** $\hat x$ satisfy the **normal equation** $$ A^{T}A\hat x = A^{T}b $$This equation is always consistent for $\hat x$ by construction, and that it gives $\hat x$ such that $A\hat x$ is as close to $b$ as possible. **Remark.** If $Ax=b$ happens to be consistent, then the least squares solutions from the normal equation are just the actual solutions to $Ax=b$. And yes, the word "normal" is being used everywhere. So be context aware. **Remark.** To form the normal equation to $Ax=b$, just multiply $A^{T}$ to both sides! **Remark.** The process of finding the least squares solutions is sometimes called **the method of least squares**. **Example.** Consider the linear equation $$ \begin{pmatrix}1 & 5\\2 & 10\end{pmatrix} \vec x = \begin{pmatrix}1\\0\end{pmatrix} $$which is inconsistent. Find the least squares solutions to this system. $\blacktriangleright$ Here $A=\begin{pmatrix}1&5\\2&10\end{pmatrix}$. Apply $A^{T}$ to both sides, we get normal equation$$ \begin{align*} \begin{pmatrix}1&2\\5&10\end{pmatrix}\begin{pmatrix}1&2\\5&10\end{pmatrix} \hat x = \begin{pmatrix}1&2\\5&10\end{pmatrix}\begin{pmatrix}1\\0\end{pmatrix} \\ \implies \begin{pmatrix}11 & 22\\55 & 110\end{pmatrix} \hat x = \begin{pmatrix}1\\5\end{pmatrix} \end{align*} $$Upon row reduction, gives $$ \left[\begin{array}{cc|c}11 & 22 & 1\\55 & 110 & 5\end{array}\right]\stackrel{\text{row}}\sim \left[\begin{array}{cc|c}1 & 2 & \frac{1}{ 11}\\0 & 0 & 0\end{array}\right] $$So $$ \hat x = \begin{pmatrix}-2\\1\end{pmatrix}t+\begin{pmatrix} \frac{1}{11}\\0\end{pmatrix}\quad \text{where $t$ is free.} $$Here we found infinitely many least squares solutions $\hat x$. $\blacklozenge$ I will make a remark here, of when we expect a unique least squares solution, as we see it is always consistent and we can have infinitely many least squares solutions: > **Consistency and uniqueness of the normal equation.** > Consider linear equation $Ax = b$. The normal equation $$ A^{T}A\hat x = A^{T}b $$is always consistent, and it has unique solution if and only if the columns of $A$ are linearly independent. We will prove this result as a corollary as another lemma, which we do after an example. **Example.** Find the least squares solution the linear equation $Ax = b$ where $$ A = \begin{pmatrix}2 & 1\\2 & 1\\2&2\end{pmatrix}, b= \begin{pmatrix}0\\1\\1\end{pmatrix} $$Is the least squares solution unique? $\blacktriangleright$ To find the least squares solution to $Ax = b$ we set up the normal equation $$ \begin{align*} A^{T}Ax & = A^{T}b \\ \implies \begin{pmatrix}2 & 2 & 2 \\ 1 & 1 & 2\end{pmatrix} \begin{pmatrix}2 & 1\\2 & 1\\2 & 2\end{pmatrix} \hat x & = \begin{pmatrix}2 & 2 & 2\\1 & 1 & 2\end{pmatrix} \begin{pmatrix}0\\1\\1\end{pmatrix}\\ \implies \begin{pmatrix}12 & 8 \\ 8 & 6\end{pmatrix} \hat x = \begin{pmatrix}4\\3\end{pmatrix} \end{align*} $$which we row reduce, $$ \left[\begin{array}{cc|c} 12 & 8 & 4\\8 & 6 & 3 \end{array}\right] \stackrel{\text{row}}\sim \left[\begin{array}{cc|c} 1 & 0 & 0\\0 & 1 & 1 / 2 \end{array}\right] $$ Hence the least squares solution is $$ \hat x = \begin{pmatrix}0\\1 / 2\end{pmatrix}. $$Observe here we get a unique least squares solution, and this agrees with above proposition that if the columns of $A$ are linearly independent, then the least squares solutions is unique. $\blacklozenge$ Now, if the matrix $A^{T}A$ is invertible, then the normal equation $A^{T}A \hat x = A^{T}b$ would have unique solution. And the invertibility of $A^{T}A$ is tied to the linear independence of the columns of $A$ due to the following lemma. > **Lemma.** > Let $A$ be any $n\times k$ real matrix. Then, the columns of $A$ are linearly independent if and only if the columns of $A^{T}A$ are linearly independent. $\blacktriangleright$ Proof. $(\implies)$ Suppose the columns of $A$ are linearly independent. This means the equation $Ax=\vec 0$ has only the trivial solution $x = \vec 0$. We wish to show $A^{T}A$ has linearly independent columns, this means we wish to show the equation $A^{T}A x = \vec 0$ has only the trivial solution $x = \vec 0$. So, if $A^{T}Ax = \vec 0$, then $$ \begin{array}{lll} & A^{T}A x = \vec 0 \\ \implies & x^{T}A^{T}Ax = x^{T}\vec 0 & \text{multiply both sides by $x^T$ from the left} \\ \implies & (Ax)^{T}(Ax) = 0 & \text{property of transpose}\\ \implies & (Ax) \cdot (Ax) = 0 & \text{dot product intepretation}\\ \implies & Ax = \vec 0 & \text{positive-definiteness of dot product}\\ \implies & x = \vec 0 & \text{hypothesis that $Ax=\vec 0$ has only trivial solution.} \end{array} $$Hence we see that $A^{T}Ax = \vec 0$ has only the trivial solution $x = \vec 0$, hence $A^{T}A$ has linearly independent columns. $(\impliedby)$ Suppose the columns of $A^{T}A$ are linearly independent. This means the equation $A^{T}Ax = \vec 0$ has only the trivial solution $x = \vec 0$. We wish to show $A$ has linearly independent columns, this means we wish to show the equation $Ax=\vec 0$ has only the trivial solution $x = \vec 0$. So, if $Ax = \vec 0$, then $$ \begin{array}{lll} & Ax = \vec 0 \\ \implies & A^{T}Ax=A^{T}\vec 0 & \text{multiply both sides by $A^T$ from the left} \\ \implies & A^{T}Ax = \vec 0 & \text{multiply to zero vector is zero vector} \\ \implies & x = \vec 0 & \text{hypothesis that $A^TAx = \vec0$ has only trivial solution.} \end{array} $$Hence we see that $Ax = \vec0$ has only the trivial solution $x = \vec 0$, hence $A$ has linearly independent columns. $\blacksquare$ ## Least-squares line: Line of best fit. The method of least squares can be used to find the "best fit" linear combination of things. Let us find the "line of best fit" using the method of least squares. **Example.** Consider the following data points $$ \begin{array}{c|c} x & 0 & 1 & 2 & 3 \\ \hline y & 2 & 2 & 1 & 0 \end{array} $$ Find a line of best fit of the form $y = mx + b$ for our data, using the method of least squares. By sketching data points out on a plane, one can see that there is no line that would go through every single point. That is to say, there is no choice of $m,b$ such that the following linear system is consistent: $$ \begin{array}{} 2 = m(0) + b \\ 2 = m(1) + b \\ 1 = m(2) + b \\ 0 = m(3) + b \end{array} \iff \underbrace{\begin{bmatrix}0 & 1\\1 & 1\\2 & 1\\3 & 1\end{bmatrix} }_{X}\underbrace{\begin{bmatrix}m\\b\end{bmatrix}}_{\beta} = \underbrace{\begin{bmatrix}2\\2\\1\\0\end{bmatrix}}_{y} $$ For some terminologies moving forward, let us call this matrix $X$ the **design matrix**, the vector $\beta$ the **parameter vector**, and the vector $y$ the **observation vector**, consisting of the $y$-values of the data points. This linear equation $X\beta = y$ is inconsistent. What we can do, however, is to solve for the least squares solutions. To find the least squares solution, we set up the normal equation $X^{T}X\beta = X^{T}y$, $$ \begin{array}{l} \begin{bmatrix}0 & 1 & 2 & 3\\ 1 & 1 & 1 & 1\end{bmatrix} \begin{bmatrix}0 & 1\\1 & 1\\2 & 1\\3 & 1\end{bmatrix} \begin{bmatrix}\hat m\\\hat b\end{bmatrix} = \begin{bmatrix}0 & 1 & 2 & 3\\ 1 & 1 & 1 & 1\end{bmatrix} \begin{bmatrix}2\\2\\1\\0\end{bmatrix} \\ \implies \begin{bmatrix}14 & 6\\6 & 4\end{bmatrix} \begin{bmatrix}\hat m\\ \hat b\end{bmatrix} = \begin{bmatrix}9\\5\end{bmatrix} \\ \end{array} $$And we row reduce $$ \left[\begin{array}{cc|c} 14 & 6 & 4 \\ 6 & 4 & 5 \end{array}\right] \stackrel{\text{row}}\sim \left[\begin{array}{cc|c} 1 & 0 & -\frac{7}{10} \\ 0 & 1 & \frac{23}{10} \end{array}\right] $$This gives the parameter vector $$ \beta = \begin{bmatrix}\hat m\\\hat b\end{bmatrix} = \begin{bmatrix}-\frac{7}{10} \\ \frac{23}{10}\end{bmatrix} $$ So the best fit line to our data is $$ \hat y = - \frac{7}{10} x + \frac{23}{10} $$Let us plot it ![[smc-spring-2024-math-13/linear-algebra-notes/---files/best-fit-line.png]] The error between the actual $y$ value from a data point $(x,y)$ to the predicted value by the best fit line $\hat y = \hat m x + \hat b$ is called the **residual**, > If $(x_{i},y_{i})$ is a data point, and $\hat y_{i}$ is the value predicted by a model at $x_{i}$, then the difference $$ r_{i} = y_{i}-\hat y_{i} $$is called the residual of the $i$-th data point. This is the (signed) distance of the red line segment. Since we found $\begin{bmatrix}\hat m\\\hat b\end{bmatrix}$ such that the distance between $$ \begin{bmatrix}\hat y_{1}\\\hat y_{2}\\\vdots \\ \hat y_{n}\end{bmatrix} \text{ and } \begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{bmatrix} $$is minimized, that is $$ (y_{1}-\hat y_{1})^{2}+(y_{2}-\hat y_{2})^{2}+\cdots +(y_{n}-\hat y_{n})^{2} $$is minimized. In other words, > Applying the method of least squares on a set of data points $$ (x_{1},y_{1}),\ldots,(x_{n},y_{n}) $$to produce a line of best fit $\hat y = \hat m x + \hat b$ gives a line that minimizes the sum of the squares of the residuals. This could be useful in statistics, one can only guess. ## Least-squares fitting of general curves. The method of least squares is not limited to finding "line of best fit", but more generally it can be used to find a model of best fit where the model is a linear combination of functions with yet to be determined coefficients (parameters), that minimizes the sum of the squares of the residuals: > **Least-squares fitting of general curves.** > Given a set of $n$ data points $(x_{i},y_{i})$, then one can find a model of best fit of the form $$ \hat y = \hat \beta_{1} f_{1}(x) + \hat \beta_{2} f_{2}(x)+\cdots \hat \beta_{k} f_{k}(x) $$such that the sum of squares of the residuals $$ \sum (y_{i}-\hat y_{i})^{2} $$is minimized, by solving (possibly inconsistent) linear system $$ \begin{array}{} y_{1} = \beta_{1} f_{1}(x_{1})+ \cdots +\beta_{k}f_{k}(x_{1}) \\ y_{2} = \beta_{1} f_{1}(x_{2})+ \cdots +\beta_{k}f_{k}(x_{2}) \\ \vdots \\ y_{n} = \beta_{1}f_{1}(x_{n}) + \cdots +\beta_{k}f_{k}(x_{n}) \end{array} $$equivalently the linear equation $$ \underbrace{\begin{bmatrix}f_{1}(x_{1}) & f_{2}(x_{1}) & \cdots & f_{k}(x_{1}) \\ f_{1}(x_{2}) & f_{2}(x_{2}) & \cdots & f_{k}(x_{2}) \\ & &\vdots\\ f_{1}(x_{n}) & f_{2}(x_{n}) & \cdots & f_{k}(x_{n}) \\ \end{bmatrix} }_{X}\underbrace{\begin{bmatrix}\beta_{1}\\\beta_{2}\\\vdots\\\beta_{k}\end{bmatrix}}_{\beta} = \underbrace{\begin{bmatrix}y_{1}\\y_{2}\\\vdots\\y_{n}\end{bmatrix}}_{y} $$Here $X$ is the design matrix, $\beta$ is the parameter vector, and $y$ is the observation vector. We can find the least-squares solution to the linear system $X\beta = y$ by setting up the normal equation $X^{T}X\hat \beta = X^{T}y$. This may look daunting but I suggest you follow your nose to set it up: - First write down the form of the model you are trying to fit your data with. - Make sure your model is a linear combination of functions with yet-to-be-determined parameters $\beta_{i}$ - Substitute the data points $(x_{i},y_{i})$ to get a system of linear equations, which you can write it as $X\beta = y$. - This system is very likely to be inconsistent, but regardless we can always solve for least squares solutions $\hat\beta$ by solving the normal equation $X^{T}X\hat \beta = X^{T}y$. - Put the parameters $\beta_{i}$ from $\hat \beta$ back into your model, this is now a model that minimizes the sum of squares of residuals! **Remark.** Also, here I am loosely using the $\hat.$ notation loosely to denote a vector or entry that is obtained from the least-squares method. **Example.** (Numerical) Using the method of least squares, find a model of the form $$ \hat y = \hat \beta_{1} + \hat \beta_{2} x^{2} + \hat \beta_{3} \frac{1}{x} $$that best fits the following data and minimizes the sum of the squares of the residuals $$ \begin{array}{c|cccc} x & 1 & 2 & 3 & 4 & 5\\ \hline y & 3 & 1 & 3 & 4 & 6 \end{array} $$Approximating the parameters to 1 decimal place. $\blacktriangleright$ Applying the model to the data points give the following system of equations $$ \begin{array}{} 3 = \beta_{1} +\beta_{2}(1)^{2} + \beta_{3} \frac{1}{1} \\ 1 = \beta_{1} +\beta_{2}(2)^{2} + \beta_{3} \frac{1}{2} \\ 3 = \beta_{1} +\beta_{2}(3)^{2} + \beta_{3} \frac{1}{3} \\ 4 = \beta_{1} +\beta_{2}(4)^{2} + \beta_{3} \frac{1}{4} \\ 6 = \beta_{1} +\beta_{2}(5)^{2} + \beta_{3} \frac{1}{5} \end{array} \iff \underbrace{\begin{bmatrix}1 & 1 & \frac{1}{1} \\ 1 &4 & \frac{1}{2} \\ 1&9& \frac{1}{3} \\ 1&16& \frac{1}{4} \\ 1&25& \frac{1}{5}\end{bmatrix}}_{X}\underbrace{\begin{pmatrix}\beta_{1}\\\beta_{2}\\\beta_{3}\end{pmatrix}}_{\beta} =\underbrace{ \begin{pmatrix}3\\1\\3\\4\\6\end{pmatrix}}_{y}. $$This gives a linear system $X\beta =y$, with design matrix $X$, parameter vector $\beta$, and observation vector $y$. We find a least-squares solution to $X\beta = y$ by solving the normal equation $X^{T}X\beta = X^{T}\beta$, which gives $$ \begin{bmatrix}1 & 1 & 1 & 1 & 1\\ 1 & 4 & 9 & 16&25\\ \frac{1}{1} & \frac{1}{2} &\frac{1}{3} & \frac{1}{4}& \frac{1}{5}\end{bmatrix}\begin{bmatrix}1 & 1 & \frac{1}{1} \\ 1 &4 & \frac{1}{2} \\ 1&9& \frac{1}{3} \\ 1&16& \frac{1}{4} \\ 1&25& \frac{1}{5}\end{bmatrix} \begin{pmatrix}\hat\beta_1\\\hat\beta_{2}\\\hat\beta_{3}\end{pmatrix} = \begin{bmatrix}1 & 1 & 1 & 1 & 1\\ 1 & 4 & 9 & 16&25\\ \frac{1}{1} & \frac{1}{2} &\frac{1}{3} & \frac{1}{4}& \frac{1}{5}\end{bmatrix} \begin{pmatrix}3\\1\\3\\4\\6\end{pmatrix} $$ $$ \begin{bmatrix}5 & 55 & \frac{137}{60} \\ 55 & 979 & 15 \\ \frac{137}{60} & 15 & \frac{5269}{3600}\end{bmatrix}\begin{pmatrix}\hat\beta_1\\\hat\beta_{2}\\\hat\beta_{3}\end{pmatrix} = \begin{pmatrix}17\\248\\ \frac{67}{10}\end{pmatrix} $$ Solving this we get approximately $$ \beta_{1} \approx0.6 , \quad \beta_{2} \approx 0.2, \quad \beta_{3} \approx 1.3 $$So the least-squares curve of the given form for the data is (approximately) $$ \hat y = 0.6 + 0.2 x^{2} + 1.3 \left(\frac{1}{x}\right). $$Let us plot this least-squares curve against the data ![[smc-spring-2024-math-13/linear-algebra-notes/---files/Pasted image 20240524121818.png]]